Finding Romanized Arabic Dialect in Code-Mixed Tweets

نویسندگان

  • Clare R. Voss
  • Stephen Tratz
  • Jamal Laoudi
  • Douglas M. Briesch
چکیده

Recent computational work on Arabic dialect identification has focused primarily on building and annotating corpora written in Arabic script. Arabic dialects however also appear written in Roman script, especially in social media. This paper describes our recent work developing tweet corpora and a token-level classifier that identifies a romanized Arabic dialect and distinguishes it from French and English in tweets. We focus on Moroccan Darija, one of several spoken vernaculars in the family of Maghrebi Arabic dialects. Even given noisy, code-mixed tweets, the classifier achieved token-level recall of 93.2% on romanized Arabic dialect, 83.2% on English, and 90.1% on French. The classifier, now integrated into our tweet conversation annotation tool (Tratz et al. 2013), has semi-automated the construction of a romanized Arabic-dialect lexicon. Two datasets, a full list of Moroccan Darija surface token forms and a table of lexical entries derived from this list with spelling variants, as extracted from our tweet corpus collection, will be made available in the LRE MAP.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Tweet Conversation Annotation Tool with a Focus on an Arabic Dialect, Moroccan Darija

This paper presents the DATOOL, a graphical tool for annotating conversations consisting of short messages (i.e., tweets), and the results we obtain in using it to annotate tweets for Darija, an historically unwritten Arabic dialect spoken by millions but not taught in schools and lacking standardization and linguistic resources. With the DATOOL, a native-Darija speaker annotated hundreds of mi...

متن کامل

Borrowing the Verb “ast” and Its Varieties in Arabic Dialect of Sarab

“Borrowing” is a lingual process that is studied in diachronic linguistics. In this process a language borrows elements from another language. This process usually occurs in areas that two languages make contact with each other. In a dialect spoken in South Khorasan the language borrowing happens. Arabs living in this part of Iran probably have immigrated in the early centuries of Islam. In thi...

متن کامل

The Status of [h] and [ʔ] in the Sistani Dialect of Miyankangi

The purpose of this article is to determine the phonemic status of [h] and [ʔ] in the Sistani dialect of Miyankangi. Auditory tests applied to the relevant data show that [ʔ] occurs mainly in word-initial position, where it stands in free variation with Ø. The only place where [h] is heard is in Arabic and Persian loanwords, and only in the pronunciation of some speakers who are educated and/or...

متن کامل

Preprocessing Egyptian Dialect Tweets for Sentiment Mining

Research done on Arabic sentiment analysis is considered very limited almost in its early steps compared to other languages like English whether at document-level or sentence-level. In this paper, we test the effect of preprocessing (normalization, stemming, and stop words removal) on the performance of an Arabic sentiment analysis system using Arabic tweets from twitter. The sentiment (positiv...

متن کامل

Revisiting Automatic Transliteration Problem for Code-Mixed Romanized Indian Social Media Text

Although automatic Transliteration for Indian languages is a well studied paradigm, but availab le t ransliteration techniques fail in the Indian social media context due to phenomena such as wordplay, creative spelling, codemixing, and phonetic romanized typing; all implying that transliteration for Indian social media text has to be revisited. The paper reports an init ial study on automatic ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014